Tutorial https://www.kaggle.com/roshansharma/online-shopper-s-intention
DataSet https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
Quellen:
C. Okan Sakar:
Department of Computer Engineering, Faculty of Engineering and Natural Sciences, Bahcesehir University, 34349 Besiktas, Istanbul, Turkey
Yomi Kastro:
Inveon Information Technologies Consultancy and Trade, 34335 Istanbul, Turkey
1. Data
1.1 DataExploration/ExplainFeatures
1.1.1 DataSetInfo
1.1.2 DataSetDescribe
1.1.3 DataSetProfiling
1.1.3 DataSetPairPlot
1.2 Exploration (NULL)
1.3 Exploration (Classes)
1.4 Exploration and Preparation (Visualisation)
1.4.1 DataPreparation and Transformation
1.4.2 Exploration and HypothesenChecks (Visualisation)
2. Classification
2.1 Preparation
2.1.1 CleanDataFrame
2.1.2 Split des Datensatzes
2.2 Modellselection
2.3 Measure and Fit Standard Models
2.3.1 Multilayer ROC-Curve
3. Optimized-Classificaton
3.1 CrossValidation
3.2 KFold
3.3 RandomSearch+CrossValidation
3.3.1 BestScoredModel
3.3.2 ScoredModelList
3.3.3 Sorted ScoredModelList
3.3.4 Visualitation
3.4 ModelExplanation
3.4.1 Important Features
4. Clustering
4.1 Feature Selection
4.2 Train and Visualisation
9. TEST SECTION
import pandas as pd
import pandas_profiling
import numpy as np
import seaborn as sns
import time
import datetime
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
%%time
df = pd.read_csv("data/eCommerce/online_shoppers_intention.csv")
Wall time: 21.9 ms
df
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.000000 | 0.0 | Feb | 1 | 1 | 1 | 1 | Returning_Visitor | False | False |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.000000 | 0.100000 | 0.000000 | 0.0 | Feb | 2 | 2 | 1 | 2 | Returning_Visitor | False | False |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.000000 | 0.0 | Feb | 4 | 1 | 9 | 3 | Returning_Visitor | False | False |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.050000 | 0.140000 | 0.000000 | 0.0 | Feb | 3 | 2 | 2 | 4 | Returning_Visitor | False | False |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.020000 | 0.050000 | 0.000000 | 0.0 | Feb | 3 | 3 | 1 | 4 | Returning_Visitor | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12325 | 3 | 145.0 | 0 | 0.0 | 53 | 1783.791667 | 0.007143 | 0.029031 | 12.241717 | 0.0 | Dec | 4 | 6 | 1 | 1 | Returning_Visitor | True | False |
| 12326 | 0 | 0.0 | 0 | 0.0 | 5 | 465.750000 | 0.000000 | 0.021333 | 0.000000 | 0.0 | Nov | 3 | 2 | 1 | 8 | Returning_Visitor | True | False |
| 12327 | 0 | 0.0 | 0 | 0.0 | 6 | 184.250000 | 0.083333 | 0.086667 | 0.000000 | 0.0 | Nov | 3 | 2 | 1 | 13 | Returning_Visitor | True | False |
| 12328 | 4 | 75.0 | 0 | 0.0 | 15 | 346.000000 | 0.000000 | 0.021053 | 0.000000 | 0.0 | Nov | 2 | 2 | 3 | 11 | Returning_Visitor | False | False |
| 12329 | 0 | 0.0 | 0 | 0.0 | 3 | 21.250000 | 0.000000 | 0.066667 | 0.000000 | 0.0 | Nov | 3 | 2 | 1 | 2 | New_Visitor | True | False |
12330 rows × 18 columns
"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" und "Product Related Duration" stellen die Anzahl der verschiedenen Arten von Seiten dar, die der Besucher in dieser Sitzung besucht hat, sowie die Gesamtzeit, die in jeder dieser Seitenkategorien verbracht wurde. Die Werte dieser Funktionen werden aus den URL-Informationen der vom Benutzer besuchten Seiten abgeleitet und in Echtzeit aktualisiert, wenn ein Benutzer eine Aktion ausführt, z.B. von einer Seite zur anderen.
| Typ | URL (Beispiel) |
|---|---|
| Administrative | /?login |
| Administrative | /?logout |
| Administrative | /LoginRegister |
| Administrative | /passwordrecovery |
| .... | .... |
| Product Related | / |
| Product Related | /search |
| Product Related | /cart |
| .... | .... |
| Informational | /stores |
| Informational | /Catalog |
| .... | .... |
Die Funktionen "Bounce Rate", "Exit Rate" und "Page Value" stellen die von "Google Analytics" gemessenen Kennzahlen für jede Seite der E-Commerce-Site dar.
Der Wert der Funktion "Bounce Rate" für eine Webseite bezieht sich auf den Prozentsatz der Besucher, die die Webseite von dieser Seite aus betreten und dann verlassen ("Bounce"), ohne während dieser Sitzung weitere Anfragen an den Analyseserver zu richten.
Der Wert der Funktion "Exit Rate" für eine bestimmte Webseite wird wie bei allen Seitenaufrufen auf der Seite berechnet, der Prozentsatz, der der letzte in der Sitzung war.
Die Funktion "Page Value" stellt den Durchschnittswert für eine Webseite dar, die ein Benutzer vor Abschluss einer E-Commerce-Transaktion besucht hat.
Die Funktion "Special Day" zeigt die Nähe der Besuchszeit der Website zu einem bestimmten speziellen Tag (z.B. Muttertag, Valentinstag) an, an dem die Sitzungen eher mit der Transaktion abgeschlossen werden. Der Wert dieses Attributs wird unter Berücksichtigung der Dynamik des E-Commerce bestimmt, wie z.B. der Dauer zwischen Bestelldatum und Liefertermin. Zum Beispiel nimmt dieser Wert für den Valentinstag zwischen dem 2. und 12. Februar einen Wert ungleich Null an, Null vor und nach diesem Datum, es sei denn, er liegt nahe an einem anderen besonderen Tag, und seinen Höchstwert von 1 am 8. Februar. Der Datensatz enthält auch Betriebssystem, Browser, Region, Verkehrstyp, Besuchertyp als wiederkehrender oder neuer Besucher, einen booleschen Wert, der angibt, ob das Datum des Besuchs das Wochenende ist, und den Monat des Jahres.
"TrafficType" ist der Typ über den die Besucher auf die Website gekommen sind (z.B. Banner, SMS, direkt).
"Month","VisitorType" selbstsprechender kategorischer diskreter Wert.
"OperatingSystems", "Browser", "Region" kategorische, numerische Einteilung
"Revenue", "Weekend" boolsche Werte und sind selbstsprechend. Umsatz entstanden und am Wochenende True oder False
Die entsprechenden Features mit ihren Datentypen aufgezeigt.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12330 entries, 0 to 12329 Data columns (total 18 columns): Administrative 12330 non-null int64 Administrative_Duration 12330 non-null float64 Informational 12330 non-null int64 Informational_Duration 12330 non-null float64 ProductRelated 12330 non-null int64 ProductRelated_Duration 12330 non-null float64 BounceRates 12330 non-null float64 ExitRates 12330 non-null float64 PageValues 12330 non-null float64 SpecialDay 12330 non-null float64 Month 12330 non-null object OperatingSystems 12330 non-null int64 Browser 12330 non-null int64 Region 12330 non-null int64 TrafficType 12330 non-null int64 VisitorType 12330 non-null object Weekend 12330 non-null bool Revenue 12330 non-null bool dtypes: bool(2), float64(7), int64(7), object(2) memory usage: 1.5+ MB
Erster einblick. Anzahl, Mittelwert, Standardabweichung, Quantile (0.25, 0.5, 0.75) sowie min und max
df.describe()
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | OperatingSystems | Browser | Region | TrafficType | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 | 12330.000000 |
| mean | 2.315166 | 80.818611 | 0.503569 | 34.472398 | 31.731468 | 1194.746220 | 0.022191 | 0.043073 | 5.889258 | 0.061427 | 2.124006 | 2.357097 | 3.147364 | 4.069586 |
| std | 3.321784 | 176.779107 | 1.270156 | 140.749294 | 44.475503 | 1913.669288 | 0.048488 | 0.048597 | 18.568437 | 0.198917 | 0.911325 | 1.717277 | 2.401591 | 4.025169 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 184.137500 | 0.000000 | 0.014286 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 1.000000 | 2.000000 |
| 50% | 1.000000 | 7.500000 | 0.000000 | 0.000000 | 18.000000 | 598.936905 | 0.003112 | 0.025156 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 |
| 75% | 4.000000 | 93.256250 | 0.000000 | 0.000000 | 38.000000 | 1464.157213 | 0.016813 | 0.050000 | 0.000000 | 0.000000 | 3.000000 | 2.000000 | 4.000000 | 4.000000 |
| max | 27.000000 | 3398.750000 | 24.000000 | 2549.375000 | 705.000000 | 63973.522230 | 0.200000 | 0.200000 | 361.763742 | 1.000000 | 8.000000 | 13.000000 | 9.000000 | 20.000000 |
Das Profiling dient der generalisierten und standardisierten Betrachtung eines Datensatzes. Das Framework/Modul ermöglicht eine repräsentation die entsprechend gescrollt werden kann um einen ersten Einblick in den Datensatz zu erhalten.
!!Achtung einige Werte sind "kategorische" Werte bzw. verstecken sich hinter integer Werten auch kategorische Interpretationen!!
Diese Form dient der ersten Anschauung und bedarf einer tiefgreifenden Analyse im folgenden.
%%time
pandas_profiling.ProfileReport(df)
Wall time: 16.3 s
Der Pairplot ermöglicht eine Repräsentation der Verteilung auf und mit jedem Attribut. Das ist sofern von Nutzen da sich hier auch entsprechende Dopplungen bzw. Abhängigkeiten heraus kristalisieren. Diese bedarf es natürlich durch eine Korrelationsanalyse zu ermitteln und bei bedarf (was meistens der Fall ist) entfernen. Auffäligkeiten (auf den ersten Blick):
Sollten weiter untersucht werden
TypeObject kann nicht geplottet ... hier das entfernen aller Object Typen nötig
sns.set(style="ticks", color_codes=True)
testdf = df.drop(columns = ['VisitorType','Weekend','Revenue','Month'])
#testdf = df.drop(columns = ['VisitorType','Weekend','Revenue'])
#testdf = df.drop(columns = ['Weekend','Revenue','Month'])
#g = sns.pairplot(testdf)
#g = sns.pairplot(testdf)
#fig = g.get_figure()
#fig.savefig("pairplot.png")
%%time
g = sns.pairplot(testdf)
Wall time: 10.4 s
%%time
df.isnull().sum().sum()
Wall time: 2.99 ms
0
%%time
df['Revenue'].value_counts()
Wall time: 997 µs
0 10422 1 1908 Name: Revenue, dtype: int64
# Verteilung Umsatz J/N und am Wochende käufe J/N
plt.rcParams['figure.figsize'] = (12, 6)
plt.subplot(1, 2, 1)
sns.countplot(df['Revenue'], palette = 'winter')#CMRmap')#'bright')
plt.title('gekauft?', fontsize = 30)
plt.xlabel('Umsatz oder nicht', fontsize = 15)
plt.ylabel('Anzahl', fontsize = 15)
# checking the Distribution of customers on Weekend
plt.subplot(1, 2, 2)
sns.countplot(df['Weekend'], palette = 'deep')
plt.title('Wochenende?', fontsize = 30)
plt.xlabel('Wochenende oder nicht', fontsize = 15)
plt.ylabel('Anzahl', fontsize = 15)
plt.show()
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 1 | Der Umsatz hängt vom Wochenende ab | Der Umsatz hängt nicht vom Wochenende ab | H1 angenommen / Korrelationskoeffizient |
import numpy
print("Korrelationskoeffizient zwischen Weekend und Umsatz (Revenue): "+str(numpy.corrcoef(df['Weekend'], df['Revenue'])[0, 1]))
print("Gesamtlänge des DataFrames: "+str(len(df)))
print("################## Weekend ######################")
print(df['Weekend'].value_counts())
falsevalue_pct = round(9462/len(df),4)
print("Verhältnis False: "+str(falsevalue_pct))
truevalue_pct = round(1-falsevalue_pct,4)
print("Verhältnis True: "+str(truevalue_pct))
print("################## Revenue/Umsatz ######################")
print(df['Revenue'].value_counts())
falsevalue_pct = round(10422/len(df),4)
print("Verhältnis False: "+str(falsevalue_pct))
truevalue_pct = round(1-falsevalue_pct,4)
print("Verhältnis True: "+str(truevalue_pct))
Korrelationskoeffizient zwischen Weekend und Umsatz (Revenue): 0.02929536797199438 Gesamtlänge des DataFrames: 12330 ################## Weekend ###################### False 9462 True 2868 Name: Weekend, dtype: int64 Verhältnis False: 0.7674 Verhältnis True: 0.2326 ################## Revenue/Umsatz ###################### False 10422 True 1908 Name: Revenue, dtype: int64 Verhältnis False: 0.8453 Verhältnis True: 0.1547
# Kuchendiegramme
plt.rcParams['figure.figsize'] = (18, 7)
size = [10551, 1694, 85]
colors = ["#9b59b6", "#3498db", "#95a5a6"]
labels = "Returning Visitor", "New_Visitor", "Others"
explode = [0, 0.1, 0.1]
plt.subplot(1, 2, 1)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
plt.title('unterschiedliche Besucher', fontsize = 30)
plt.axis('off')
plt.legend()
# plotting a pie chart for browsers
size = [7961, 2462, 736, 467,174, 163, 300]
colors = ['orange', 'yellow', 'pink', 'crimson', 'lightgreen', 'cyan', 'blue']
labels = "2", "1","4","5","6","10","others"
explode = [0, 0.01,0.01,0.1,0.1,0.1, 0.1]
plt.subplot(1, 2, 2)
plt.pie(size, colors = colors, labels = labels, explode = explode,shadow = True, autopct = '%.2f%%', startangle = 90)
plt.title('unterschiedliche Browser', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()
df.replace({'VisitorType': 'New_Visitor'},value=1, inplace=True)
df.replace({'VisitorType': 'Other'},value=2, inplace=True)
df.replace({'VisitorType': 'Returning_Visitor'},value=3, inplace=True)
VSlist = df['VisitorType'].value_counts()
VSlist.sort_values(ascending=False, inplace=True, kind='quicksort')
VSlist
3 10551 1 1694 2 85 Name: VisitorType, dtype: int64
#print(df['OperatingSystems'].value_counts())
#print(df['Browser'].value_counts())
OSlist = df['OperatingSystems'].value_counts()
OSlist.sort_values(ascending=False, inplace=True, kind='quicksort')
OSlist
2 6601 1 2585 3 2555 4 478 8 79 6 19 7 7 5 6 Name: OperatingSystems, dtype: int64
df.replace({'OperatingSystems': 5},value=4, inplace=True)
df.replace({'OperatingSystems': 7},value=4, inplace=True)
df.replace({'OperatingSystems': 6},value=4, inplace=True)
df.replace({'OperatingSystems': 8},value=4, inplace=True)
df.replace({'OperatingSystems': 4},value=3, inplace=True)
OSlist = df['OperatingSystems'].value_counts()
OSlist.sort_values(ascending=False, inplace=True, kind='quicksort')
OSlist
2 6601 3 3144 1 2585 Name: OperatingSystems, dtype: int64
BWlist = df['Browser'].value_counts()
BWlist.sort_values(ascending=False, inplace=True, kind='quicksort')
BWlist
2 7961 1 2462 4 736 5 467 6 174 10 163 8 135 3 105 13 61 7 49 12 10 11 6 9 1 Name: Browser, dtype: int64
df.replace({'Browser': 9},value=6, inplace=True)
df.replace({'Browser': 11},value=6, inplace=True)
df.replace({'Browser': 12},value=6, inplace=True)
df.replace({'Browser': 7},value=6, inplace=True)
df.replace({'Browser': 13},value=6, inplace=True)
df.replace({'Browser': 3},value=6, inplace=True)
df.replace({'Browser': 8},value=6, inplace=True)
df.replace({'Browser': 10},value=6, inplace=True)
df.replace({'Browser': 4},value=3, inplace=True)
df.replace({'Browser': 5},value=4, inplace=True)
df.replace({'Browser': 6},value=5, inplace=True)
BWlist = df['Browser'].value_counts()
BWlist.sort_values(ascending=False, inplace=True, kind='quicksort')
BWlist
2 7961 1 2462 3 736 5 704 4 467 Name: Browser, dtype: int64
OSlist
2 6601 3 3144 1 2585 Name: OperatingSystems, dtype: int64
# Kuchendiegramme
plt.rcParams['figure.figsize'] = (18, 7)
size = OSlist
colors = ["#9b59b6", "#3498db", "#95a5a6"]
labels = "Windows", "Linux & Other", "MacOS"
explode = [0, 0.1, 0.1]
plt.subplot(1, 2, 1)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
plt.title('unterschiedliche Betriebssysteme', fontsize = 30)
plt.axis('off')
plt.legend()
# plotting a pie chart for browsers
size = BWlist
colors = ['lightblue', '#3498db', 'orange', '#95a5a6', 'lightgreen']
labels = "Chrome", "Firefox","IE & Edge","Safari","Opera"
explode = [0, 0.1,0.1,0.1,0.1]
plt.subplot(1, 2, 2)
plt.pie(size, colors = colors, labels = labels, explode = explode,shadow = True, autopct = '%.2f%%', startangle = 90)
plt.title('unterschiedliche Browser', fontsize = 30)
plt.axis('off')
plt.legend()
plt.show()
# visualizing the distribution of customers around the Region
plt.rcParams['figure.figsize'] = (18, 7)
plt.subplot(1, 2, 1)
plt.hist(df['TrafficType'], color = 'lightgreen')
plt.title('Verteilung der unterschiedlichen Traffic Typen',fontsize = 20)
plt.xlabel('TrafficType Codes', fontsize = 15)
plt.ylabel('Anzahl', fontsize = 15)
# visualizing the distribution of customers around the Region
plt.subplot(1, 2, 2)
plt.hist(df['Region'], color = 'lightblue')
plt.title('Verteilung der Nutzer auf die Regionen',fontsize = 20)
plt.xlabel('Region Codes', fontsize = 15)
plt.ylabel('Anzahl', fontsize = 15)
plt.show()
# product related duration vs revenue
plt.rcParams['figure.figsize'] = (18, 15)
plt.subplot(2, 2, 1)
sns.boxenplot(df['Revenue'], df['Informational_Duration'], palette = 'rainbow')
plt.title('Info. duration vs Revenue', fontsize = 30)
plt.xlabel('Info. duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
# product related duration vs revenue
plt.subplot(2, 2, 2)
sns.boxenplot(df['Revenue'], df['Administrative_Duration'], palette = 'pastel')
plt.title('Admn. duration vs Revenue', fontsize = 30)
plt.xlabel('Admn. duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
# product related duration vs revenue
plt.subplot(2, 2, 3)
sns.boxenplot(df['Revenue'], df['ProductRelated_Duration'], palette = 'dark')
plt.title('Product Related duration vs Revenue', fontsize = 30)
plt.xlabel('Product Related duration', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
# exit rate vs revenue
plt.subplot(2, 2, 4)
sns.boxenplot(df['Revenue'], df['ExitRates'], palette = 'spring')
plt.title('ExitRates vs Revenue', fontsize = 30)
plt.xlabel('ExitRates', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 2 | Der Aufenthalt (Dauer) im Kontext der allgemeinen Informationssektionen hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der Aufenthalt (Dauer) im Kontext der allgemeinen Informationssektionen hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
| 3 | Der Aufenthalt (Dauer) im Kontext der Adminitrationssektion hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der Aufenthalt (Dauer) im Kontext der Adminitrationssektion hat eine Auswirkung auf bzw. Zusammenhang den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
| 4 | Der Aufenthalt (Dauer) im Kontext der Produktsektionen hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der Aufenthalt (Dauer) im Kontext der Produktsektionen hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
| 5 | Das Verlassen der Seite hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Das Verlassen der Seite hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
print("Korrelationskoeffizient zwischen ExitRates und Umsatz (Revenue): "+str(numpy.corrcoef(df['ExitRates'], df['Revenue'])[0, 1]))
print("Korrelationskoeffizient zwischen ProductRelated_Duration und Umsatz (Revenue): "+str(numpy.corrcoef(df['ProductRelated_Duration'], df['Revenue'])[0, 1]))
print("Korrelationskoeffizient zwischen Administrative_Duration und Umsatz (Revenue): "+str(numpy.corrcoef(df['Administrative_Duration'], df['Revenue'])[0, 1]))
print("Korrelationskoeffizient zwischen Informational_Duration und Umsatz (Revenue): "+str(numpy.corrcoef(df['Informational_Duration'], df['Revenue'])[0, 1]))
Korrelationskoeffizient zwischen ExitRates und Umsatz (Revenue): -0.20707108205527205 Korrelationskoeffizient zwischen ProductRelated_Duration und Umsatz (Revenue): 0.15237261055701043 Korrelationskoeffizient zwischen Administrative_Duration und Umsatz (Revenue): 0.09358671905704201 Korrelationskoeffizient zwischen Informational_Duration und Umsatz (Revenue): 0.0703445023459834
#page values vs revenue
plt.rcParams['figure.figsize'] = (18, 7)
plt.subplot(1, 2, 1)
sns.stripplot(df['Revenue'], df['PageValues'], palette = 'autumn')
plt.title('PageValues vs Revenue', fontsize = 30)
plt.xlabel('PageValues', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
# bounce rates vs revenue
plt.subplot(1, 2, 2)
sns.stripplot(df['Revenue'], df['BounceRates'], palette = 'magma')
plt.title('Bounce Rates vs Revenue', fontsize = 30)
plt.xlabel('Boune Rates', fontsize = 15)
plt.ylabel('Revenue', fontsize = 15)
plt.show()
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 6 | Der PageValue hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der PageValue hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann abgelehnt werden, Korrelationskoeffizient |
| 7 | Die BounceRate hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Die BounceRate hat eine Auswirkung auf bzw. Zusammenhang den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
print("Korrelationskoeffizient zwischen PageValue und Umsatz (Revenue): "+str(numpy.corrcoef(df['PageValues'], df['Revenue'])[0, 1]))
print("Korrelationskoeffizient zwischen BounceRate und Umsatz (Revenue): "+str(numpy.corrcoef(df['BounceRates'], df['Revenue'])[0, 1]))
Korrelationskoeffizient zwischen PageValue und Umsatz (Revenue): 0.49256929525120763 Korrelationskoeffizient zwischen BounceRate und Umsatz (Revenue): -0.15067291192605398
# weekend vs Revenue
df_data = pd.crosstab(df['Weekend'], df['Revenue'])
df_data.div(df_data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (7, 4), color = ['orange', 'crimson'])
plt.title('Weekend vs Revenue', fontsize = 20)
plt.show()
df_data
| Revenue | False | True |
|---|---|---|
| Weekend | ||
| False | 8053 | 1409 |
| True | 2369 | 499 |
# Traffic Type vs Revenue
df_data = pd.crosstab(df['TrafficType'], df['Revenue'])
df_data.div(df_data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['lightpink', 'yellow'])
plt.title('Traffic Type vs Revenue', fontsize = 30)
plt.show()
df_data
| Revenue | False | True |
|---|---|---|
| TrafficType | ||
| 1 | 2189 | 262 |
| 2 | 3066 | 847 |
| 3 | 1872 | 180 |
| 4 | 904 | 165 |
| 5 | 204 | 56 |
| 6 | 391 | 53 |
| 7 | 28 | 12 |
| 8 | 248 | 95 |
| 9 | 38 | 4 |
| 10 | 360 | 90 |
| 11 | 200 | 47 |
| 12 | 1 | 0 |
| 13 | 695 | 43 |
| 14 | 11 | 2 |
| 15 | 38 | 0 |
| 16 | 2 | 1 |
| 17 | 1 | 0 |
| 18 | 10 | 0 |
| 19 | 16 | 1 |
| 20 | 148 | 50 |
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 8 | Der TrafficType hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der TrafficType hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
print("Korrelationskoeffizient zwischen TrafficType und Umsatz (Revenue): "+str(numpy.corrcoef(df['TrafficType'], df['Revenue'])[0, 1]))
import scipy
scipy.stats.spearmanr(df['TrafficType'],df['Revenue'])
Korrelationskoeffizient zwischen TrafficType und Umsatz (Revenue): -0.005112970502755556
SpearmanrResult(correlation=-0.0011891693746148063, pvalue=0.8949584583439678)
# visitor type vs revenue
df_data = pd.crosstab(df['VisitorType'], df['Revenue'])
df_data.div(df_data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['lightgreen', 'green'])
plt.title('Visitor Type vs Revenue', fontsize = 30)
plt.show()
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 9 | Der PageValue hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Der PageValue hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
print("Korrelationskoeffizient zwischen PageValue und Umsatz (Revenue): "+str(numpy.corrcoef(df['VisitorType'], df['Revenue'])[0, 1]))
import scipy
scipy.stats.spearmanr(df['VisitorType'],df['Revenue'])
Korrelationskoeffizient zwischen PageValue und Umsatz (Revenue): -0.10472572201866671
SpearmanrResult(correlation=-0.10423887899368257, pvalue=3.878131128342029e-31)
# region vs Revenue
df_data = pd.crosstab(df['Region'], df['Revenue'])
df_data.div(df_data.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, figsize = (15, 5), color = ['blue', 'lightblue'])
plt.title('Region vs Revenue', fontsize = 30)
plt.show()
| Nr. | 0-Hypothese | A-Hypothese | Annahme / Verfahren |
|---|---|---|---|
| 10 | Die Region hat keine Auswirkung bzw. Zusammenhang auf den Umsatz | Die Region hat eine Auswirkung bzw. Zusammenhang auf den Umsatz | H0 kann nicht abgelehnt werden, Korrelationskoeffizient |
print("Korrelationskoeffizient zwischen Region und Umsatz (Revenue): "+str(numpy.corrcoef(df['Region'], df['Revenue'])[0, 1]))
import scipy
scipy.stats.spearmanr(df['Region'],df['Revenue'])
Korrelationskoeffizient zwischen Region und Umsatz (Revenue): -0.011595067777800517
SpearmanrResult(correlation=-0.014792282525562174, pvalue=0.10049341182361644)
# Monat (X) vs pagevalues (Y) ausprägung Umsatz (revenue)
plt.rcParams['figure.figsize'] = (18, 15)
plt.subplot(2, 2, 1)
sns.boxplot(x = df['Month'], y = df['PageValues'], hue = df['Revenue'], palette = 'inferno')
plt.title('Monat (X) vs pagevalues (Y) Ausprägung Umsatz (revenue)', fontsize = 15)
# Monat (X) vs ExitRates (Y) Ausprägung Umsatz (revenue)
plt.subplot(2, 2, 2)
sns.boxplot(x = df['Month'], y = df['ExitRates'], hue = df['Revenue'], palette = 'Reds')
plt.title('Monat (X) vs ExitRates (Y) Ausprägung Umsatz (revenue)', fontsize = 15)
# Monat (X) vs BounceRate (Y) Ausprägung Umsatz (revenue)
plt.subplot(2, 2, 3)
sns.boxplot(x = df['Month'], y = df['BounceRates'], hue = df['Revenue'], palette = 'Oranges')
plt.title('Monat (X) vs BounceRate (Y) Ausprägung Umsatz (revenue)', fontsize = 15)
# VisitorType (X) vs BounceRate (Y) Ausprägung Umsatz (revenue)
plt.subplot(2, 2, 4)
sns.boxplot(x = df['VisitorType'], y = df['BounceRates'], hue = df['Revenue'], palette = 'Purples')
plt.title('VisitorType (X) vs BounceRate (Y) Ausprägung Umsatz (revenue)', fontsize = 15)
plt.show()
# visitor type vs exit rates w.r.t revenue
plt.rcParams['figure.figsize'] = (18, 15)
plt.subplot(2, 2, 1)
sns.violinplot(x = df['VisitorType'], y = df['ExitRates'], hue = df['Revenue'])#, palette = 'rainbow')
plt.title('Visitors vs ExitRates wrt Rev.', fontsize = 30)
# visitor type vs exit rates w.r.t revenue
plt.subplot(2, 2, 2)
sns.violinplot(x = df['VisitorType'], y = df['PageValues'], hue = df['Revenue'])#, palette = 'gnuplot')
plt.title('Visitors vs PageValues wrt Rev.', fontsize = 30)
# region vs pagevalues w.r.t. revenue
plt.subplot(2, 2, 3)
sns.violinplot(x = df['Region'], y = df['PageValues'], hue = df['Revenue'])#, palette = 'Greens')
plt.title('Region vs PageValues wrt Rev.', fontsize = 30)
#region vs exit rates w.r.t. revenue
plt.subplot(2, 2, 4)
sns.violinplot(x = df['Region'], y = df['ExitRates'], hue = df['Revenue'])#, palette = 'spring')
plt.title('Region vs Exit Rates w.r.t. Revenue', fontsize = 30)
plt.show()
Vorbereitung des Datensatzes für die verwendeten Modelle. Dazu sollten nurnoch numerische Werte im Datensatz vorhanden sein
Die gesamten Monate bzw. die Monate werden als eigenes Feature abgebildet
# one hot encoding für alle !!!
dfnew = pd.DataFrame(df)
df_data1 = pd.get_dummies(df)
df_data1.columns
Index(['Administrative', 'Administrative_Duration', 'Informational',
'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay',
'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
'Weekend', 'Revenue', 'Month_Aug', 'Month_Dec', 'Month_Feb',
'Month_Jul', 'Month_June', 'Month_Mar', 'Month_May', 'Month_Nov',
'Month_Oct', 'Month_Sep'],
dtype='object')
# label encoding für den Umsatz
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dfnew['Revenue'] = le.fit_transform(dfnew['Revenue'])
dfnew['Revenue'].value_counts()
0 10422 1 1908 Name: Revenue, dtype: int64
# Abhängige und unabhängige Variablen
x = df_data1
# entfernen meines Targets (Umsatz)
x = x.drop(['Revenue'], axis = 1)
y = dfnew['Revenue']
# Dimensionen checken
print("Shape of x:", x.shape)
print("Shape of y:", y.shape)
Shape of x: (12330, 26) Shape of y: (12330,)
# OneHotEncoding checken ... soweit okay
x.head(10)
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | ... | Month_Aug | Month_Dec | Month_Feb | Month_Jul | Month_June | Month_Mar | Month_May | Month_Nov | Month_Oct | Month_Sep | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.000000 | 0.000000 | 0.100000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0.0 | 0 | 0.0 | 2 | 2.666667 | 0.050000 | 0.140000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0.0 | 0 | 0.0 | 10 | 627.500000 | 0.020000 | 0.050000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0.0 | 0 | 0.0 | 19 | 154.216667 | 0.015789 | 0.024561 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0.0 | 0 | 0.0 | 1 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.4 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 0.0 | 0 | 0.0 | 0 | 0.000000 | 0.200000 | 0.200000 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0.0 | 0 | 0.0 | 2 | 37.000000 | 0.000000 | 0.100000 | 0.0 | 0.8 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0.0 | 0 | 0.0 | 3 | 738.000000 | 0.000000 | 0.022222 | 0.0 | 0.4 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 26 columns
Anpassung von x und y im vorgelaggerten Teil nötig
# Split des Datensatzes
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 123)
# Dimensionen überprüfen
print("Shape of x_train :", x_train.shape)
print("Shape of y_train :", y_train.shape)
print("Shape of x_test :", x_test.shape)
print("Shape of y_test :", y_test.shape)
Shape of x_train : (8631, 26) Shape of y_train : (8631,) Shape of x_test : (3699, 26) Shape of y_test : (3699,)
(https://scikit-learn.org/stable/modules/ensemble.html)
GradientBoostingClassifier
Die Vorteile von GBRT sind:
Die Nachteile von GBRT sind:
x Voting Classifier
# MODELLING
# Ensamble
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
# Naiven
from sklearn.naive_bayes import GaussianNB
# linear
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression
# GaussianProcess
# https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
model_RF = RandomForestClassifier()
model_RF.fit(x_train, y_train)
my_models = [
RandomForestClassifier(),
BaggingClassifier(),
ExtraTreesClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
#VotingClassifier(estimators=1000),
GaussianNB(),
#RidgeClassifier(),
LogisticRegression()
]
my_models_name = [
'RandomForestClassifier (Ensamble)',
'BaggingClassifier (Ensamble)',
'ExtraTreesClassifier (Ensamble)',
'AdaBoostClassifier (Ensamble)',
'GradientBoostingClassifier (Ensamble)',
'GaussianNB (NaiveBayes)',
#'RidgeClassifier (linear)',
'LogisticRegression (linear)'
]
y_pred_array = []
for i in range(len(my_models)):
my_models[i].fit(x_train, y_train)
y_pred_array.append(my_models[i].predict(x_test))
precision_array = []
accuracy_array = []
accuracyBalanced_array = []
recall_array = []
TPR_array=[]
TNR_array = []
Fmeasure_array = []
for i in range(len(my_models)):
precision_array.append(precision_score(y_test, my_models[i].predict(x_test)))
accuracy_array.append(accuracy_score(y_test, my_models[i].predict(x_test)))
recall_array.append(recall_score(y_test, my_models[i].predict(x_test)))
TPR_array.append(recall_score(y_test, my_models[i].predict(x_test)))
TNR_array.append((1-(recall_score(y_test, my_models[i].predict(x_test)))))
Fmeasure_array.append(2*((precision_array[i]*recall_array[i])/(precision_array[i]+recall_array[i])))
accuracyBalanced_array.append( (TNR_array[i]+TPR_array[i])/2 )
d = {'Model': my_models_name,
'Precision': precision_array,
'Accuracy': accuracy_array,
'Balanced Accuracy': accuracyBalanced_array,
'Recall': recall_array,
'TPR': TPR_array,
'TNR': TNR_array,
'Fmeasure': Fmeasure_array}
dataframe_with_scores = pd.DataFrame(data=d)
dataframe_with_scores
| Model | Precision | Accuracy | Balanced Accuracy | Recall | TPR | TNR | Fmeasure | |
|---|---|---|---|---|---|---|---|---|
| 0 | RandomForestClassifier (Ensamble) | 0.743529 | 0.891322 | 0.5 | 0.518883 | 0.518883 | 0.481117 | 0.611219 |
| 1 | BaggingClassifier (Ensamble) | 0.712617 | 0.884563 | 0.5 | 0.500821 | 0.500821 | 0.499179 | 0.588235 |
| 2 | ExtraTreesClassifier (Ensamble) | 0.731928 | 0.876994 | 0.5 | 0.399015 | 0.399015 | 0.600985 | 0.516472 |
| 3 | AdaBoostClassifier (Ensamble) | 0.660305 | 0.880779 | 0.5 | 0.568144 | 0.568144 | 0.431856 | 0.610768 |
| 4 | GradientBoostingClassifier (Ensamble) | 0.704918 | 0.889430 | 0.5 | 0.564860 | 0.564860 | 0.435140 | 0.627165 |
| 5 | GaussianNB (NaiveBayes) | 0.441970 | 0.808597 | 0.5 | 0.619048 | 0.619048 | 0.380952 | 0.515732 |
| 6 | LogisticRegression (linear) | 0.736156 | 0.874561 | 0.5 | 0.371100 | 0.371100 | 0.628900 | 0.493450 |
rf_roc_auc_array = []
fpr_array = []
tpr_array = []
thresholds_array = []
plt.figure(figsize=(12,6))
for i in range(len(my_models)):
rf_roc_auc_array.append(roc_auc_score(y_test, my_models[i].predict(x_test)))
fpr, tpr, thresholds = roc_curve(y_test, my_models[i].predict_proba(x_test)[:,1])
fpr_array.append(fpr)
tpr_array.append(tpr)
thresholds_array.append(thresholds)
plt.plot(fpr, tpr, label=my_models_name[i]+' (area = %0.2f)' % rf_roc_auc_array[i])
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate / Falsch erkannte Treffer')
plt.ylabel('True Positive Rate / Richtig erkannte Treffer')
plt.title('Receiver operating characteristic - ROC-Graph')
plt.legend(loc="lower right")
plt.savefig('RF_ROC')
plt.show()
https://scikit-learn.org/stable/modules/cross_validation.html
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
#xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 123)
X = x
y = y
for i in range(len(my_models)):
print("CV: "+str(cross_val_score(my_models[i], X, y, cv=6))+" <- "+str(my_models_name[i]))
CV: [0.90997567 0.91192214 0.87396594 0.80924574 0.87737226 0.87591241] <- RandomForestClassifier (Ensamble) CV: [0.9163017 0.91435523 0.87250608 0.75231144 0.87347932 0.86520681] <- BaggingClassifier (Ensamble) CV: [0.86861314 0.89440389 0.85158151 0.72992701 0.88029197 0.85985401] <- ExtraTreesClassifier (Ensamble) CV: [0.85693431 0.88321168 0.88126521 0.80291971 0.87250608 0.87007299] <- AdaBoostClassifier (Ensamble) CV: [0.9026764 0.92311436 0.88515815 0.79367397 0.88856448 0.88856448] <- GradientBoostingClassifier (Ensamble) CV: [0.87250608 0.87396594 0.76107056 0.18832117 0.75279805 0.75815085] <- GaussianNB (NaiveBayes) CV: [0.88272506 0.88467153 0.87931873 0.78588808 0.87737226 0.86909976] <- LogisticRegression (linear)
from sklearn.model_selection import cross_validate
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
#scoring = ['precision_macro', 'recall_macro']
scoring = {'prec_macro': 'precision_macro','rec_macro': make_scorer(recall_score, average='macro')}
for i in range(len(my_models)):
cv_results = cross_validate(my_models[i], X, y, cv=10, scoring=scoring)
#print(i)
cv_results.update( {'Modell' : my_models_name[i]} )
cv_results
if i == 0:
dfdict = pd.DataFrame.from_dict(cv_results, orient='columns')
else:
dftemp = pd.DataFrame.from_dict(cv_results, orient='columns')
dfdict = dfdict.append(dftemp)
dfdict.head(7)
| fit_time | score_time | test_prec_macro | test_rec_macro | Modell | |
|---|---|---|---|---|---|
| 0 | 0.098737 | 0.007978 | 0.904009 | 0.608509 | RandomForestClassifier (Ensamble) |
| 1 | 0.096741 | 0.007979 | 0.806267 | 0.722142 | RandomForestClassifier (Ensamble) |
| 2 | 0.095745 | 0.006981 | 0.904048 | 0.750786 | RandomForestClassifier (Ensamble) |
| 3 | 0.092752 | 0.006981 | 0.809415 | 0.701443 | RandomForestClassifier (Ensamble) |
| 4 | 0.095744 | 0.007979 | 0.861753 | 0.735517 | RandomForestClassifier (Ensamble) |
| 5 | 0.090757 | 0.007979 | 0.754726 | 0.731289 | RandomForestClassifier (Ensamble) |
| 6 | 0.092752 | 0.008976 | 0.756864 | 0.687570 | RandomForestClassifier (Ensamble) |
dfdict.sort_values(by=['fit_time'], ascending=True).head(10)
| fit_time | score_time | test_prec_macro | test_rec_macro | Modell | |
|---|---|---|---|---|---|
| 1 | 0.015957 | 0.004987 | 0.676780 | 0.666021 | GaussianNB (NaiveBayes) |
| 4 | 0.015957 | 0.004987 | 0.671756 | 0.684696 | GaussianNB (NaiveBayes) |
| 5 | 0.015957 | 0.004987 | 0.449269 | 0.425983 | GaussianNB (NaiveBayes) |
| 6 | 0.015957 | 0.004987 | 0.520950 | 0.539749 | GaussianNB (NaiveBayes) |
| 0 | 0.015958 | 0.004987 | 0.673736 | 0.508554 | GaussianNB (NaiveBayes) |
| 3 | 0.015958 | 0.004986 | 0.639710 | 0.628106 | GaussianNB (NaiveBayes) |
| 8 | 0.015958 | 0.003989 | 0.662734 | 0.734433 | GaussianNB (NaiveBayes) |
| 7 | 0.015958 | 0.004986 | 0.678509 | 0.769824 | GaussianNB (NaiveBayes) |
| 2 | 0.015958 | 0.004986 | 0.778285 | 0.709952 | GaussianNB (NaiveBayes) |
| 9 | 0.016955 | 0.003989 | 0.649447 | 0.734620 | GaussianNB (NaiveBayes) |
dfdict.sort_values(by=['test_prec_macro','fit_time'], ascending=False).head(10)
| fit_time | score_time | test_prec_macro | test_rec_macro | Modell | |
|---|---|---|---|---|---|
| 2 | 0.081781 | 0.007978 | 0.929229 | 0.712740 | ExtraTreesClassifier (Ensamble) |
| 2 | 0.741019 | 0.006981 | 0.907313 | 0.811214 | GradientBoostingClassifier (Ensamble) |
| 2 | 0.095745 | 0.006981 | 0.904048 | 0.750786 | RandomForestClassifier (Ensamble) |
| 0 | 0.098737 | 0.007978 | 0.904009 | 0.608509 | RandomForestClassifier (Ensamble) |
| 2 | 0.882640 | 0.073802 | 0.889644 | 0.618020 | AdaBoostClassifier (Ensamble) |
| 2 | 0.408907 | 0.018949 | 0.888366 | 0.783597 | BaggingClassifier (Ensamble) |
| 0 | 0.717083 | 0.005984 | 0.886676 | 0.594941 | GradientBoostingClassifier (Ensamble) |
| 0 | 0.181515 | 0.004986 | 0.886332 | 0.554015 | LogisticRegression (linear) |
| 2 | 0.125664 | 0.004987 | 0.870324 | 0.669635 | LogisticRegression (linear) |
| 4 | 0.095744 | 0.007979 | 0.861753 | 0.735517 | RandomForestClassifier (Ensamble) |
import numpy as np
from sklearn.model_selection import KFold
def best_kfold_fitted_model(model_list,models_name_list,x,y,n_splits):
xnew = x
ynew = y
#KFOLD Config
kf = KFold(n_splits=n_splits)
kf.get_n_splits(xnew)
#die ganzen score arrays ... koennen ausgeduent werden !!!!
precision_array = []
accuracy_array = []
accuracyBalanced_array = []
recall_array = []
TPR_array=[]
TNR_array = []
Fmeasure_array = []
name_array = []
fold_array = []
j=0
#return Values für die rueckgabe
bestACC = 0.0
bestF1 = 0.0
bestPre = 0.0
bestmodel = 0
for train_index, test_index in kf.split(xnew):
print("TRAIN:", train_index, "TEST:", test_index)
kfold_x_train, kfold_x_test = xnew.iloc[train_index], xnew.iloc[test_index]
kfold_y_train, kfold_y_test = ynew.iloc[train_index], ynew.iloc[test_index]
for i in range(len(model_list)):
model_list[i].fit(kfold_x_train, kfold_y_train)
precision_array.append(precision_score(kfold_y_test, model_list[i].predict(kfold_x_test)))
accuracy_array.append(accuracy_score(kfold_y_test, model_list[i].predict(kfold_x_test)))
recall_array.append(recall_score(kfold_y_test, model_list[i].predict(kfold_x_test)))
TPR_array.append(recall_score(kfold_y_test, model_list[i].predict(kfold_x_test)))
TNR_array.append((1-(recall_score(kfold_y_test, model_list[i].predict(kfold_x_test)))))
Fmeasure_array.append(2*((precision_array[i]*recall_array[i])/(precision_array[i]+recall_array[i])))
accuracyBalanced_array.append( (TNR_array[i]+TPR_array[i])/2 )
name_array.append(models_name_list[i])
fold_array.append(j)
tempACC = accuracy_score(kfold_y_test, model_list[i].predict(kfold_x_test))
tempPre = precision_score(kfold_y_test, model_list[i].predict(kfold_x_test))
tempF1 = 2*((precision_array[i]*recall_array[i])/(precision_array[i]+recall_array[i]))
print(tempACC)
if bestACC < tempACC and bestF1 < tempF1 and bestPre < tempPre:
#if bestACC < tempACC:
print('erreicht')
bestACC = tempACC
bestF1 = tempF1
bestPre = tempPre
bestIndex = i
bestmodel = model_list[i]
j+=1
d = {'Model': name_array,
'Fold Number': fold_array,
'Precision': precision_array,
'Accuracy': accuracy_array,
'Balanced Accuracy': accuracyBalanced_array,
'Recall': recall_array,
'TPR': TPR_array,
'TNR': TNR_array,
'Fmeasure': Fmeasure_array}
return pd.DataFrame(data=d), [bestIndex,bestmodel,bestACC,bestPre,bestF1]
dataframe_with_scores, bestmodel = best_kfold_fitted_model(my_models,my_models_name,x,y,8)
TRAIN: [ 1542 1543 1544 ... 12327 12328 12329] TEST: [ 0 1 2 ... 1539 1540 1541] 0.9487678339818417 erreicht 0.9552529182879378 erreicht 0.9396887159533074 0.9461738002594033 0.9552529182879378 0.9221789883268483 0.9364461738002594 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [1542 1543 1544 ... 3081 3082 3083] 0.9312581063553826 0.9254215304798963 0.9163424124513618 0.914396887159533 0.9319066147859922 0.8929961089494164 0.9111543450064851 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [3084 3085 3086 ... 4622 4623 4624] 0.9370538611291369 0.9409474367293965 0.9377027903958468 0.9234263465282284 0.9467878001297858 0.9072031148604802 0.9318624269954575 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [4625 4626 4627 ... 6163 6164 6165] 0.9065541855937703 0.9013627514600908 0.899415963659961 0.8955223880597015 0.9026606099935107 0.7806619078520441 0.8864373783257625 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [6166 6167 6168 ... 7704 7705 7706] 0.8754055807916937 0.8695652173913043 0.863724853990915 0.845554834523037 0.8747566515249838 0.5197923426346528 0.8604802076573653 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [7707 7708 7709 ... 9245 9246 9247] 0.8384166125892277 0.8377676833225178 0.827384815055159 0.8598312783906554 0.8656716417910447 0.7605451005840363 0.8513951979234263 TRAIN: [ 0 1 2 ... 12327 12328 12329] TEST: [ 9248 9249 9250 ... 10786 10787 10788] 0.8611291369240752 0.8565866320571057 0.844905905256327 0.8663205710577547 0.8754055807916937 0.8053212199870214 0.844905905256327 TRAIN: [ 0 1 2 ... 10786 10787 10788] TEST: [10789 10790 10791 ... 12327 12328 12329] 0.853990914990266 0.8507462686567164 0.8241401687216093 0.8585334198572355 0.8663205710577547 0.7702790395846852 0.8390655418559377
my_models = [
RandomForestClassifier(),
BaggingClassifier(),
ExtraTreesClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
#VotingClassifier(estimators=1000),
GaussianNB(),
#RidgeClassifier(),
LogisticRegression()
]
my_models_name = [
'RandomForestClassifier (Ensamble)',
'BaggingClassifier (Ensamble)',
'ExtraTreesClassifier (Ensamble)',
'AdaBoostClassifier (Ensamble)',
'GradientBoostingClassifier (Ensamble)',
'GaussianNB (NaiveBayes)',
#'RidgeClassifier (linear)',
'LogisticRegression (linear)'
]
cpu_cores = 8
param_dist = [{"max_depth": [1,3,5,7,9,11, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"class_weight": [None],
"max_leaf_nodes": [None],
"min_impurity_decrease": [0.0, 0.1, 0.2],
"min_impurity_split": [None],
"min_samples_leaf": sp_randint(1, 11),
"n_estimators": [50,100,200,500],
"n_jobs": [cpu_cores],
"min_weight_fraction_leaf": [0.0, 0.1, 0.2],
"random_state": [None, 42, 132],
"warm_start": [True, False]
},
{"bootstrap": [True, False],
"bootstrap_features": [True, False],
"max_features": [1.0],
"max_samples": [1.0],
"n_estimators": [10,50,100,200,500],
"n_jobs": [cpu_cores],
"random_state": [None, 42, 132],
"warm_start": [True, False]
},
{"max_depth": [1,3,5,7,9,11, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"class_weight": [None],
"max_leaf_nodes": [None],
"min_impurity_decrease": [0.0, 0.1, 0.2],
"min_impurity_split": [None],
"min_samples_leaf": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"n_estimators": [50,100,200,500],
"n_jobs": [cpu_cores],
"min_weight_fraction_leaf": [0.0, 0.1, 0.2],
"random_state": [None, 42, 132],
"warm_start": [True, False]
},
{"algorithm": ['SAMME.R', 'SAMME'],
"learning_rate": [1.0],
"n_estimators": [10,50,100,200,500],
"random_state": [None, 42, 132],
},
{"criterion": ['friedman_mse', 'mse', 'mae'], #mean absolute error (mae).
"learning_rate": [0.1],
"loss": ['deviance', 'exponential'],
"max_depth": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"n_estimators": [10,20,30],
"min_weight_fraction_leaf": [0.0, 0.1, 0.2],
"random_state": [None, 42, 132],
"warm_start": [True, False],
"presort": ['auto'],
"subsample": [1.0],
"tol": [0.0001],
"validation_fraction": [0.1]
},
{"var_smoothing": [1e-09]
},
{"C": [1.0, 0.5, 1.1, 0.9],
"fit_intercept": [True, False],
"intercept_scaling": sp_randint(1, 11),
"l1_ratio": [None],
"max_iter": [100,200,500,1000],
"multi_class": ['ovr','multinomial','auto'],
"n_jobs": [cpu_cores],
"penalty": ['l2'],
"tol": [0.0001],
"random_state": [None, 42, 132],
"warm_start": [True, False],
"solver": ['lbfgs', 'sag', 'saga'],
}]
#param_dist[0]['max_depth']
#param_dist
print(__doc__)
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from sklearn.model_selection import KFold
# SupportMethode die die Scores ausgibt und die besten Paramter zurueck gibt
def get_best_params_and_report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0} Mean validation score: {1:.3f} (std: {2:.3f})".format(i,
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
#print("Parameters: {0}".format(results['params'][candidate]))
#print("")
if i == 1:
bestparams = results['params'][candidate]
bestscore = {'mean':results['mean_test_score'][candidate],'std': results['std_test_score'][candidate]}
return bestparams, bestscore
#Stellt das beste Modell bereit
def best_opti_fitted_model(model_list,models_name_list,param_dict_list,x,y,xtest,ytest):
xnew = x
ynew = y
xtest = xtest
ytest = ytest
#die ganzen score arrays ... koennen ausgeduent werden !!!!
precision_array = []
accuracy_array = []
accuracyBalanced_array = []
recall_array = []
TPR_array=[]
TNR_array = []
Fmeasure_array = []
name_array = []
time_to_optimized_hyperParameters = []
#return Values für die rueckgabe
bestACC = 0.0
bestF1 = 0.0
bestPre = 0.0
bestRec = 0.0
bestmodel = 0
bestscoreList = []
bestscoreList2 = []
#RandomizedHyperparamter
n_iter_search = 20
for i in range(len(model_list)):
#RandomInitializedModel
random_search = RandomizedSearchCV(my_models[i], param_distributions=param_dict_list[i],n_iter=n_iter_search, cv=5, iid=False)
start = time()
random_search.fit(xnew, ynew)
time_to_optimized_hyperParameters.append(time() - start)
print("\n######### "+str(models_name_list[i])+" #########")
print("RandomizedSearchCV took %.2f seconds for %d candidates parameter settings." % (time_to_optimized_hyperParameters[i], n_iter_search))
bestparams, bestscore = get_best_params_and_report(random_search.cv_results_)
bestscoreList.append(bestscore['mean'])
bestscoreList2.append(bestscore['std'])
model_list[i] = random_search.best_estimator_
model_list[i].fit(xnew, ynew)
#Berechnung der Scores
precision_array.append(precision_score(ytest, model_list[i].predict(xtest)))
accuracy_array.append(accuracy_score(ytest, model_list[i].predict(xtest)))
recall_array.append(recall_score(ytest, model_list[i].predict(xtest)))
TPR_array.append(recall_score(ytest, model_list[i].predict(xtest)))
TNR_array.append((1-(recall_score(ytest, model_list[i].predict(xtest)))))
Fmeasure_array.append(2*((precision_array[i]*recall_array[i])/(precision_array[i]+recall_array[i])))
accuracyBalanced_array.append( (TNR_array[i]+TPR_array[i])/2 )
name_array.append(models_name_list[i])
#Selektion Section
tempACC = accuracy_score(ytest, model_list[i].predict(xtest))
tempPre = precision_score(ytest, model_list[i].predict(xtest))
tempF1 = 2*((precision_array[i]*recall_array[i])/(precision_array[i]+recall_array[i]))
tempRec = recall_score(ytest, model_list[i].predict(xtest))
print(tempACC)
if bestF1 < tempF1 and bestPre < tempPre and bestRec < tempRec:
#if bestACC < tempACC:
print('######## Treffer -> Save -> '+str(name_array[i])+' ########')
bestACC = tempACC
bestF1 = tempF1
bestPre = tempPre
bestRec = tempRec
bestIndex = i
bestmodel = model_list[i]
d = {'Model': name_array,
'Precision': precision_array,
'Accuracy': accuracy_array,
'Balanced Accuracy': accuracyBalanced_array,
'Recall': recall_array,
'TPR': TPR_array,
'TNR': TNR_array,
'Fmeasure': Fmeasure_array,
'Opti_MeanValidationScore': bestscoreList,
'Opti_StdScore': bestscoreList2,
'Opti_TimeToHyP': time_to_optimized_hyperParameters}
return pd.DataFrame(data=d), [bestIndex,bestmodel,bestACC,bestPre,bestF1]
Automatically created module for IPython interactive environment
xnew_train, xnew_test, ynew_train, ynew_test = train_test_split(x, y, test_size = 0.2, random_state = 123)
print(xnew_train.shape)
print(xnew_test.shape)
print(ynew_train.shape)
print(ynew_test.shape)
(9864, 26) (2466, 26) (9864,) (2466,)
#Mehtode wird ausgeführt ... kann dauern
dataframe_with_scores, bestmodel = best_opti_fitted_model(my_models,my_models_name,param_dist,xnew_train,ynew_train,xnew_test,ynew_test)
######### RandomForestClassifier (Ensamble) ######### RandomizedSearchCV took 36.18 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.909 (std: 0.005) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) Model with rank: 2 Mean validation score: 0.847 (std: 0.000) 0.8913219789132197 ######## Treffer -> Save -> RandomForestClassifier (Ensamble) ######## ######### BaggingClassifier (Ensamble) ######### RandomizedSearchCV took 131.10 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.905 (std: 0.007) Model with rank: 2 Mean validation score: 0.904 (std: 0.006) Model with rank: 3 Mean validation score: 0.904 (std: 0.006) Model with rank: 3 Mean validation score: 0.904 (std: 0.006) 0.8909164639091647 ######### ExtraTreesClassifier (Ensamble) ######### RandomizedSearchCV took 34.32 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.881 (std: 0.002) Model with rank: 2 Mean validation score: 0.848 (std: 0.001) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) Model with rank: 3 Mean validation score: 0.847 (std: 0.000) 0.8795620437956204 ######### AdaBoostClassifier (Ensamble) ######### RandomizedSearchCV took 229.07 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.895 (std: 0.005) Model with rank: 1 Mean validation score: 0.895 (std: 0.005) Model with rank: 3 Mean validation score: 0.895 (std: 0.005) Model with rank: 3 Mean validation score: 0.895 (std: 0.005) 0.884022708840227 ######### GradientBoostingClassifier (Ensamble) ######### RandomizedSearchCV took 853.20 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.908 (std: 0.004) Model with rank: 2 Mean validation score: 0.907 (std: 0.005) Model with rank: 3 Mean validation score: 0.905 (std: 0.006) 0.8941605839416058 ######### GaussianNB (NaiveBayes) ######### RandomizedSearchCV took 0.09 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.808 (std: 0.010) 0.8061638280616383 ######### LogisticRegression (linear) ######### RandomizedSearchCV took 122.76 seconds for 20 candidates parameter settings. Model with rank: 1 Mean validation score: 0.886 (std: 0.002) Model with rank: 2 Mean validation score: 0.886 (std: 0.003) Model with rank: 3 Mean validation score: 0.885 (std: 0.002) Model with rank: 3 Mean validation score: 0.885 (std: 0.002) 0.878345498783455
bestmodel
[0,
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features=7, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=9, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=8,
oob_score=False, random_state=132, verbose=0,
warm_start=True),
0.8913219789132197,
0.707395498392283,
0.6214689265536724]
dataframe_with_scores.head(10)
| Model | Precision | Accuracy | Balanced Accuracy | Recall | TPR | TNR | Fmeasure | Opti_MeanValidationScore | Opti_StdScore | Opti_TimeToHyP | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RandomForestClassifier (Ensamble) | 0.707395 | 0.891322 | 0.5 | 0.554156 | 0.554156 | 0.445844 | 0.621469 | 0.908758 | 0.004691 | 36.184243 |
| 1 | BaggingClassifier (Ensamble) | 0.697531 | 0.890916 | 0.5 | 0.569270 | 0.569270 | 0.430730 | 0.626907 | 0.905109 | 0.006801 | 131.098444 |
| 2 | ExtraTreesClassifier (Ensamble) | 0.862319 | 0.879562 | 0.5 | 0.299748 | 0.299748 | 0.700252 | 0.444860 | 0.880880 | 0.002466 | 34.324217 |
| 3 | AdaBoostClassifier (Ensamble) | 0.659942 | 0.884023 | 0.5 | 0.576826 | 0.576826 | 0.423174 | 0.615591 | 0.895378 | 0.005171 | 229.074666 |
| 4 | GradientBoostingClassifier (Ensamble) | 0.757576 | 0.894161 | 0.5 | 0.503778 | 0.503778 | 0.496222 | 0.605144 | 0.907745 | 0.004395 | 853.200898 |
| 5 | GaussianNB (NaiveBayes) | 0.430052 | 0.806164 | 0.5 | 0.627204 | 0.627204 | 0.372796 | 0.510246 | 0.808495 | 0.010319 | 0.092752 |
| 6 | LogisticRegression (linear) | 0.741294 | 0.878345 | 0.5 | 0.375315 | 0.375315 | 0.624685 | 0.498328 | 0.886050 | 0.001544 | 122.762733 |
dataframe_with_scores.sort_values(by=['Precision','Fmeasure','Accuracy'], ascending=False).head(10)
| Model | Precision | Accuracy | Balanced Accuracy | Recall | TPR | TNR | Fmeasure | Opti_MeanValidationScore | Opti_StdScore | Opti_TimeToHyP | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | ExtraTreesClassifier (Ensamble) | 0.862319 | 0.879562 | 0.5 | 0.299748 | 0.299748 | 0.700252 | 0.444860 | 0.880880 | 0.002466 | 34.324217 |
| 4 | GradientBoostingClassifier (Ensamble) | 0.757576 | 0.894161 | 0.5 | 0.503778 | 0.503778 | 0.496222 | 0.605144 | 0.907745 | 0.004395 | 853.200898 |
| 6 | LogisticRegression (linear) | 0.741294 | 0.878345 | 0.5 | 0.375315 | 0.375315 | 0.624685 | 0.498328 | 0.886050 | 0.001544 | 122.762733 |
| 0 | RandomForestClassifier (Ensamble) | 0.707395 | 0.891322 | 0.5 | 0.554156 | 0.554156 | 0.445844 | 0.621469 | 0.908758 | 0.004691 | 36.184243 |
| 1 | BaggingClassifier (Ensamble) | 0.697531 | 0.890916 | 0.5 | 0.569270 | 0.569270 | 0.430730 | 0.626907 | 0.905109 | 0.006801 | 131.098444 |
| 3 | AdaBoostClassifier (Ensamble) | 0.659942 | 0.884023 | 0.5 | 0.576826 | 0.576826 | 0.423174 | 0.615591 | 0.895378 | 0.005171 | 229.074666 |
| 5 | GaussianNB (NaiveBayes) | 0.430052 | 0.806164 | 0.5 | 0.627204 | 0.627204 | 0.372796 | 0.510246 | 0.808495 | 0.010319 | 0.092752 |
dataframe_with_scores.sort_values(by=['Precision','Fmeasure','Accuracy'], ascending=False).iloc[0:4]
| Model | Precision | Accuracy | Balanced Accuracy | Recall | TPR | TNR | Fmeasure | Opti_MeanValidationScore | Opti_StdScore | Opti_TimeToHyP | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | ExtraTreesClassifier (Ensamble) | 0.862319 | 0.879562 | 0.5 | 0.299748 | 0.299748 | 0.700252 | 0.444860 | 0.880880 | 0.002466 | 34.324217 |
| 4 | GradientBoostingClassifier (Ensamble) | 0.757576 | 0.894161 | 0.5 | 0.503778 | 0.503778 | 0.496222 | 0.605144 | 0.907745 | 0.004395 | 853.200898 |
| 6 | LogisticRegression (linear) | 0.741294 | 0.878345 | 0.5 | 0.375315 | 0.375315 | 0.624685 | 0.498328 | 0.886050 | 0.001544 | 122.762733 |
| 0 | RandomForestClassifier (Ensamble) | 0.707395 | 0.891322 | 0.5 | 0.554156 | 0.554156 | 0.445844 | 0.621469 | 0.908758 | 0.004691 | 36.184243 |
dataframe_with_scores.sort_values(by=['Fmeasure'], ascending=False).iloc[0:10]
| Model | Precision | Accuracy | Balanced Accuracy | Recall | TPR | TNR | Fmeasure | Opti_MeanValidationScore | Opti_StdScore | Opti_TimeToHyP | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BaggingClassifier (Ensamble) | 0.697531 | 0.890916 | 0.5 | 0.569270 | 0.569270 | 0.430730 | 0.626907 | 0.905109 | 0.006801 | 131.098444 |
| 0 | RandomForestClassifier (Ensamble) | 0.707395 | 0.891322 | 0.5 | 0.554156 | 0.554156 | 0.445844 | 0.621469 | 0.908758 | 0.004691 | 36.184243 |
| 3 | AdaBoostClassifier (Ensamble) | 0.659942 | 0.884023 | 0.5 | 0.576826 | 0.576826 | 0.423174 | 0.615591 | 0.895378 | 0.005171 | 229.074666 |
| 4 | GradientBoostingClassifier (Ensamble) | 0.757576 | 0.894161 | 0.5 | 0.503778 | 0.503778 | 0.496222 | 0.605144 | 0.907745 | 0.004395 | 853.200898 |
| 5 | GaussianNB (NaiveBayes) | 0.430052 | 0.806164 | 0.5 | 0.627204 | 0.627204 | 0.372796 | 0.510246 | 0.808495 | 0.010319 | 0.092752 |
| 6 | LogisticRegression (linear) | 0.741294 | 0.878345 | 0.5 | 0.375315 | 0.375315 | 0.624685 | 0.498328 | 0.886050 | 0.001544 | 122.762733 |
| 2 | ExtraTreesClassifier (Ensamble) | 0.862319 | 0.879562 | 0.5 | 0.299748 | 0.299748 | 0.700252 | 0.444860 | 0.880880 | 0.002466 | 34.324217 |
plt.figure(figsize=(12,6))
for i in range(len(my_models)):
rf_roc_auc_array.append(roc_auc_score(ynew_test, my_models[i].predict(xnew_test)))
fpr, tpr, thresholds = roc_curve(ynew_test, my_models[i].predict_proba(xnew_test)[:,1])
fpr_array.append(fpr)
tpr_array.append(tpr)
thresholds_array.append(thresholds)
plt.plot(fpr, tpr, label=my_models_name[i]+' (area = %0.2f)' % rf_roc_auc_array[i])
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate / Falsch erkannte Treffer')
plt.ylabel('True Positive Rate / Richtig erkannte Treffer')
plt.title('Receiver operating characteristic - ROC-Graph -> Optimierte Modelle')
plt.legend(loc="lower right")
plt.savefig('RF_ROC_opti3')
plt.show()
| Standard ModelConfig | Optimized ModelConfig |
|---|---|
![]() |
![]() |
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods [1-7] and representing the only possible consistent and locally accurate additive feature attribution method based on expectations (see our papers for details and citations).
SHAP (SHapley Additive exPlanations) ist ein einheitlicher Ansatz, um die Ergebnisse eines jeden maschinellen Lernmodells zu erklären. SHAP verbindet Spieltheorie mit lokalen Erklärungen, vereint mehrere frühere Methoden[1-7] und stellt die einzig mögliche konsistente und lokal genaue Methode zur Attributierung von additiven Merkmalen basierend auf Erwartungen dar (siehe unsere Papiere für Details und Zitate).
Quelle: https://github.com/slundberg/shap
Abbildung der einzelnen Monate durch den LabelEncounter möglich bzw. OneHotEncoding
#RandomForest
# let's take a look at the shap values
# importing shap
import shap
explainer = shap.TreeExplainer(my_models[0])
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values[1], x_test, plot_type = 'bar')
shap.summary_plot(shap_values[1], x_test)
shap_values = explainer.shap_values(x_train.iloc[:50])
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], x_test.iloc[:50])
#ExtraTree
explainer = shap.TreeExplainer(my_models[2])
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values[1], x_test, plot_type = 'bar')
shap.summary_plot(shap_values[1], x_test)
shap_values = explainer.shap_values(x_train.iloc[:50])
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], x_test.iloc[:50])
df_data1.head(2)
| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | ... | Month_Aug | Month_Dec | Month_Feb | Month_Jul | Month_June | Month_Mar | Month_May | Month_Nov | Month_Oct | Month_Sep | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0 | 0.0 | 1 | 0.0 | 0.2 | 0.2 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0.0 | 0 | 0.0 | 2 | 64.0 | 0.0 | 0.1 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 27 columns
# Q1: Time Spent by The Users on Website vs Bounce Rates
'''
Bounce Rate :The percentage of visitors to a particular website who navigate away from the site after
viewing only one page.
'''
# Vorbereitung zum Clustern der Adminitrativen Duration und der BounceRate
x = df.iloc[:, [1, 6]].values
# checking the shape of the dataset
x.shape
(12330, 2)
from sklearn.cluster import KMeans
#Maximale Distant zwischen den Clustern
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i,
init = 'k-means++',
max_iter = 600,
n_init = 10,
random_state = 0,
algorithm = 'elkan',
tol = 0.001)
km.fit(x)
labels = km.labels_
wcss.append(km.inertia_)
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 1000, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'green', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'General Customers')
plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'red', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')
plt.title('Administrative Duration vs Duration Bounce Rate', fontsize = 20)
plt.grid()
plt.xlabel('Administrative Duration')
plt.ylabel('Bounce Rates')
plt.legend()
plt.show()

from sklearn.model_selection import GridSearchCV
model = RandomForestClassifier()
grid = GridSearchCV(estimator=model,
param_grid={
'max_depth': [3, None],
'n_estimators': (10, 30, 50, 100, 200),# 400, 600, 800, 1000),
'max_features': (2,4,6)
},
cv=10, n_jobs=-1,)
grid.fit(x_train, y_train)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
#print(grid.best_estimator_.alpha)
dfnew = pd.DataFrame(df)
df_data1 = pd.get_dummies(df)
df_data1.columns
le = LabelEncoder()
dfnew['Revenue'] = le.fit_transform(dfnew['Revenue'])
dfnew['Revenue'].value_counts()
# Abhängige und unabhängige Variablen
x = df_data1
# entfernen meines Targets (Umsatz)
x = x.drop(['Revenue'], axis = 1)
y = dfnew['Revenue']
# Dimensionen checken
print("Shape of x:", x.shape)
print("Shape of y:", y.shape)
# SupportMethode die die Scores ausgibt und die besten Paramter zurueck gibt
def get_best_params_and_report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
if i == 1:
bestparams = results['params'][candidate]
bestscore = {'mean':results['mean_test_score'][candidate],'std': results['std_test_score'][candidate]}
return bestparams, bestscore
#Stellt das beste Modell bereit
n_iter_search = 20
random_search = RandomizedSearchCV(my_models[0], param_distributions=param_dist[0],n_iter=n_iter_search, cv=5, iid=False)
random_search.fit(x, y)
start = time()
print("RandomizedSearchCV took %.2f seconds for %d candidates parameter settings." % ((time() - start), n_iter_search))
bestparams, bestscore = get_best_params_and_report(random_search.cv_results_)
print(bestparams['bootstrap'])
bestparams
random_search.best_estimator_
bestscore
from sklearn.cluster import SpectralClustering
from sklearn.cluster import dbscan